upgrades: add checkpointing for raftAppliedIndexTermMigration
#85074
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Forward-port of #84909, for posterity.
The
raftAppliedIndexTermMigration
upgrade migration could beunreliable. It iterates over all ranges and runs a
Migrate
requestwhich must be applied on all replicas. However, if any ranges merge or
replicas are unavailable, the migration fails and starts over from the
beginning. In large clusters with many ranges, this meant that it might
never complete.
This patch makes the upgrade more robust, by retrying each
Migrate
request 5 times, and checkpointing the progress after every fifth batch
(1000 ranges), allowing resumption on failure. At some point this should
be made part of the migration infrastructure.
NB: This fix was initially submitted for 22.1, and even though the
migration will be removed for 22.2, it is forward-ported for posterity.
Release note: None